The rapid global trend toward urbanization poses significant challenges for ensuring that rural populations are not left behind in development and policy considerations. Tracking vital statistics for these populations is increasingly difficult, leading to a shortage of reliable, up-to-date data. Private data, such as social media data, has shown the potential to complement or supplement the existing public data. This study explores using Facebook Advertising data, which offers a digital “census” of over two billion users, to measure and analyze rural-urban inequalities.
This vignette utilizes rSocialWatcher (Marty, 2020) to query Facebook Marketing data to understand divides between rural and urban behaviors in Sub-Saharan Africa. Facebook data is utilized to analyze statistically significant differences between chosen behaviors in urban and rural regions in Nigeria. This vignette is based on Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide (Daniele Rama et al., 2020) which similarly uses Facebook Marketing data to study the urban-rural divide in Italy.
## Load packages
library(rsocialwatcher)
library(knitr)
library(htmlwidgets)
library(htmltools)
library(jsonlite)
library(stringr)
library(dplyr)
library(tidyr)
library(viridis)
library(RColorBrewer)
library(ggplot2)
library(leaflet)
library(kableExtra)
library(ggpubr)
library(WDI)
library(sf)
library(raster)
library(exactextractr)
library(rnaturalearth)
library(rnaturalearthdata)
TOKEN <- "Enter API Token Here"
VERSION <- "Enter Version Here"
CREATION_ACT <- "Enter Creation Act Here"
To thoroughly explore and understand the prevalence of Facebook usage across Sub-Saharan Africa, it is essential to analyze the adult population data for each country within the region. The total adult population data can be obtained from the World Development Indicators (WDI). We will then query the total adult Facebook population for each country to calculate the percentage of Facebook users per country. This approach will provide a comprehensive view of Facebook’s penetration among adults in Sub-Saharan Africa, allowing us to draw meaningful insights and make informed decisions based on social media usage trends in the region.
sub_saharan_countries <- c("AGO", "BEN", "BWA", "BFA", "BDI", "CPV", "CMR", "CAF", "TCD",
"COM", "COG", "CIV", "COD", "DJI", "GNQ", "ERI", "SWZ", "ETH",
"GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "MDG",
"MWI", "MLI", "MRT", "MUS", "MOZ", "NAM", "NER", "NGA", "RWA",
"STP", "SEN", "SYC", "SLE", "SOM", "ZAF", "SSD", "TZA", "TGO",
"UGA", "ZMB", "ZWE")
population_data <- WDI(country = sub_saharan_countries, indicator = "SP.POP.1564.TO", start = 2021, end = 2021)
#> Warning in open.connection(con, "rb"): URL
#> 'https://api.worldbank.org/v2/en/country/AGO;BEN;BWA;BFA;BDI;CPV;CMR;CAF;TCD;COM;COG;CIV;COD;DJI;GNQ;ERI;SWZ;ETH;GAB;GMB;GHA;GIN;GNB;KEN;LSO;LBR;MDG;MWI;MLI;MRT;MUS;MOZ;NAM;NER;NGA;RWA;STP;SEN;SYC;SLE;SOM;ZAF;SSD;TZA;TGO;UGA;ZMB;ZWE/indicator/SP.POP.1564.TO?format=json&date=2021:2021&per_page=32500&page=9':
#> Timeout of 60 seconds was reached
head(population_data,10) %>% kable
| country | iso2c | iso3c | year | SP.POP.1564.TO |
|---|---|---|---|---|
| Angola | AO | AGO | 2021 | 18020088 |
| Burundi | BI | BDI | 2021 | 6430031 |
| Benin | BJ | BEN | 2021 | 7063242 |
| Burkina Faso | BF | BFA | 2021 | 11793330 |
| Botswana | BW | BWA | 2021 | 1643461 |
| Central African Republic | CF | CAF | 2021 | 2691324 |
| Cote d’Ivoire | CI | CIV | 2021 | 15332583 |
| Cameroon | CM | CMR | 2021 | 14923302 |
| Congo, Dem. Rep. | CD | COD | 2021 | 48438607 |
| Congo, Rep. | CG | COG | 2021 | 3263441 |
To query the data of average Facebook users by country, we use rSocialWatcher to pull the data from the Facebook Marketplace API. The data is broken down into estimates of daily active users (DAU) and lower and upper bounds for monthly active users (MAU).
AF_data <- query_fb_marketing_api(
location_unit_type = "country",
location_keys = map_param_vec(population_data$iso2c),
version = VERSION,
creation_act = CREATION_ACT,
token = TOKEN)
To determine the percentage of the adult population on Facebook, the mean of the lower and upper bounds of the monthly average users is divided by the adult population data obtained from the WDI. Due to Facebook’s age restrictions, only data for populations aged 18 and above is queried.
AF_data_filtered <- AF_data %>%
inner_join(population_data, by = c("location_keys" = "iso2c")) %>%
mutate(percent_pop = (estimate_mau_lower_bound + estimate_mau_upper_bound)/2 / SP.POP.1564.TO)
africa_sf <- ne_countries(continent = "Africa", returnclass = "sf")
africa_sf <- africa_sf %>%
inner_join(AF_data_filtered, by = c("iso_a3" = "iso3c"))
Below, the graph illustrates the percentage of the adult Facebook users across various Sub-Saharan Africa countries. Some notable observations include:
africa_sf$prcnt_p <- africa_sf$prcnt_p * 100
palette1 <- colorNumeric(
palette = "YlOrRd",
domain = africa_sf$prcnt_p
)
palette2 <- colorNumeric(
palette = "YlOrRd",
domain = africa_sf$SP_POP_
)
leaflet(height = 500) %>%
addTiles() %>%
addPolygons(data = africa_sf,
popup = ~paste("<strong>Country:</strong> ", country, "<br>",
"<strong>Facebook Percentage:</strong> ", paste(round(prcnt_p, 2),"%", sep=""), "<br>",
"<strong>Adult Population:</strong> ", round(SP_POP_/1000000, 2), " million"),
fillColor = ~palette2(SP_POP_),
color = "black",
weight = 1,
opacity = 1,
fillOpacity = 0.7,
group = "Adult Population") %>%
addPolygons(data = africa_sf,
popup = ~paste("<strong>Country:</strong> ", country, "<br>",
"<strong>Facebook Percentage:</strong> ", paste(round(prcnt_p, 2),"%", sep=""), "<br>",
"<strong>Adult Population:</strong> ", round(SP_POP_/1000000, 2), " million"),
fillColor = ~palette1(prcnt_p),
color = "black",
weight = 1,
opacity = 1,
fillOpacity = 0.7,
group = "Facebook Percentage") %>%
addLayersControl(
baseGroups = c("Adult Population", "Facebook Percentage"),
options = layersControlOptions(collapsed = FALSE, autoZIndex = FALSE)
) %>%
addLegend(position = "bottomright", pal = palette2, values = africa_sf$SP_POP_,
title = "Adult Population (millions)", className = "custom-legend legend1") %>%
addLegend(position = "bottomright", pal = palette1, values = africa_sf$prcnt_p,
title = "Facebook Percentage", className = "custom-legend legend2") %>%
onRender("
function(el, x) {
var map = this;
var legend1 = document.querySelector('.legend1');
var legend2 = document.querySelector('.legend2');
function updateLegends(e) {
if (e.name === 'Adult Population') {
legend1.style.display = 'block';
legend2.style.display = 'none';
} else if (e.name === 'Facebook Percentage') {
legend1.style.display = 'none';
legend2.style.display = 'block';
}
}
map.on('baselayerchange', updateLegends);
legend1.style.display = 'block';
legend2.style.display = 'none';
}
") %>%
tagList(
tags$style(HTML("
.custom-legend {
background-color: white;
padding: 10px;
border: 1px solid black;
border-radius: 5px;
box-shadow: 2px 2px 6px rgba(0,0,0,0.3);
}
"))
)
We needed enough regions to have statistically significant results and Facebook did not include subregion data for most of the countries. Nigeria is selected for three reasons:
Therefore, we need a country that has enough regions to support analysis. For example, South Africa only has 9 regions, which is not enough for useful analysis.
The names are adjusted to match the syntax and spelling of WDI data.
NG_fb_df <- get_fb_parameter_ids(type = "region",
country_code = "NG",
version = VERSION,
token = TOKEN)
NG_fb_df$name <- gsub("\\b state\\b", "", NG_fb_df$name, ignore.case = TRUE)
NG_fb_df$name <- str_replace_all(NG_fb_df$name, regex("Benué", ignore_case = TRUE), "Benue")
NG_fb_df$name <- str_replace_all(NG_fb_df$name, regex("Nasarawa", ignore_case = TRUE), "Nassarawa")
To obtain population data for each individual region, we use population raster data from WorldPop. Using a 1km resolution map and aggregate the data by region, we describe region population with multiple metric. This detailed approach allows for a precise and comprehensive analysis of population distribution and density across different regions.
population_raster <- raster("worldpop_data/nga_ppp_2020.tif")
population_df <- as.data.frame(population_raster, xy=TRUE)
population_df <- population_df[!is.na(population_df$nga_ppp_2020), ]
population_df$log_population <- log1p(population_df$nga_ppp_2020)
To validate our data, we extract the coordinates of locations in Nigeria with populations exceeding 1 million people and plot them on a map.
world_cities <- ne_download(scale = "medium", type = "populated_places", category = "cultural", returnclass = "sf")
#> Reading layer `ne_50m_populated_places' from data source
#> `/private/var/folders/pl/z72m_lzx4_d8qgb1jjgg6tgc0000gn/T/RtmpWFRKNr/ne_50m_populated_places.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 1251 features and 137 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -175.2206 ymin: -90 xmax: 179.2166 ymax: 78.22097
#> Geodetic CRS: WGS 84
nigeria_cities <- world_cities %>% filter(ADM0NAME == "Nigeria")
largest_nigeria_cities <- nigeria_cities %>% filter(POP_MAX > 1000000)
nigeria <- ne_states(country = "Nigeria", returnclass = "sf")
largest_nigeria_cities %>%
dplyr::select(NAME, POP_MAX, ADM1NAME) %>%
kable
| NAME | POP_MAX | ADM1NAME | geometry |
|---|---|---|---|
| Benin City | 1190000 | Edo | POINT (5.618062 6.342423) |
| Port Harcourt | 1020000 | Rivers | POINT (7.008055 4.811948) |
| Ibadan | 2628000 | Oyo | POINT (3.928036 7.381972) |
| Kaduna | 1442000 | Kaduna | POINT (7.438054 10.52196) |
| Abuja | 1576000 | Federal Capital Territory | POINT (7.489505 9.05462) |
| Kano | 3140000 | Kano | POINT (8.518092 12.00192) |
| Lagos | 9466000 | Lagos | POINT (3.389585 6.445208) |
ggplot() +
geom_raster(data = population_df, mapping = aes(x = x, y = y, fill = log_population)) +
scale_fill_gradient(low = "darkblue", high = "lightblue", na.value = "transparent") +
labs(title = "Population Density (1km) with Large Cities",
fill = "Log(Population)") +
geom_sf(data = nigeria, fill = NA, color = "black", linewidth = 0.5) +
geom_sf(data = largest_nigeria_cities, aes(geometry = geometry), fill = "white", color = "white", size = 4, shape = 2, stroke = 1.2) +
geom_text(data = largest_nigeria_cities, aes(x = st_coordinates(geometry)[,1], y = st_coordinates(geometry)[,2], label = NAME),
size = 3, nudge_y = 0.54, nudge_x = 0.1, color = "white") +
theme_minimal() +
theme(
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.background = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
To determine which regions to classify as urban, we evaluate several metrics for each region:
percentile_95 <- function(values, coverage_fractions) {
quantile(values, probs = 0.99, na.rm = TRUE)
}
nigeria_regions <- st_transform(nigeria, crs(population_raster))
total_population <- exact_extract(population_raster, nigeria_regions, fun = 'sum', progress = F)
mean_population <- exact_extract(population_raster, nigeria_regions, fun = 'mean', progress = F)
percentile_95_values <- exact_extract(population_raster, nigeria_regions, fun = percentile_95, progress = F)
nigeria_regions$pop_percentile <- percentile_95_values
nigeria_regions$total_population <- total_population
nigeria_regions$mean_population <- mean_population
nigeria_regions$log_pop_percentile = log1p(nigeria_regions$pop_percentile)
nigeria_regions$log_total_population = log1p(nigeria_regions$total_population)
nigeria_regions$log_mean_population = log1p(nigeria_regions$mean_population)
nigeria_filtered_regions <- nigeria_regions %>%
dplyr::select(name, pop_percentile, total_population, mean_population, log_pop_percentile, log_total_population, log_mean_population)
generate_continuous_legend_html <- function(title, palette, values) {
bins <- pretty(values, n = 7)
colors <- palette(bins)
labels <- paste(
"<i style='background:", colors, "; width: 18px; height: 18px; display: inline-block;'></i>",
ifelse(bins == min(bins), "Low", ifelse(bins == max(bins), "High", "")),
collapse = "<br>"
)
paste("<h4>", title, "</h4>", labels)
}
ttl_ppl_range <- range(nigeria_filtered_regions$ttl_ppl, na.rm = TRUE)
mn_pplt_range <- range(nigeria_filtered_regions$mn_pplt, na.rm = TRUE)
pp_prcn_range <- range(nigeria_filtered_regions$pp_prcn, na.rm = TRUE)
lg_ttl_range <- range(nigeria_filtered_regions$lg_ttl_, na.rm = TRUE)
lg_mn_p_range <- range(nigeria_filtered_regions$lg_mn_p, na.rm = TRUE)
lg_pp_p_range <- range(nigeria_filtered_regions$lg_pp_p, na.rm = TRUE)
pop_palette <- colorNumeric("YlOrRd", ttl_ppl_range, na.color = "transparent")
mean_palette <- colorNumeric("YlOrRd", mn_pplt_range, na.color = "transparent")
percentile_palette <- colorNumeric("YlOrRd", pp_prcn_range, na.color = "transparent")
log_pop_palette <- colorNumeric("YlOrRd", lg_ttl_range, na.color = "transparent")
log_mean_palette <- colorNumeric("YlOrRd", lg_mn_p_range, na.color = "transparent")
log_percentile_palette <- colorNumeric("YlOrRd", lg_pp_p_range, na.color = "transparent")
legend_html <- generate_continuous_legend_html("Value", pop_palette, ttl_ppl_range)
#> Warning in palette(bins): Some values were outside the color scale and will be
#> treated as NA
combined_map <- leaflet(height = 700, width = 900) %>%
addTiles() %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~pop_palette(ttl_ppl),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(ttl_ppl, 2)),
group = "Population") %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~mean_palette(mn_pplt),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(mn_pplt, 2)),
group = "Mean Pop") %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~percentile_palette(pp_prcn),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(pp_prcn, 2)),
group = "95% Pop") %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~log_pop_palette(lg_ttl_),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(lg_ttl_, 2)),
group = "Pop - Log") %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~log_mean_palette(lg_mn_p),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(lg_mn_p, 2)),
group = "Mean Pop - Log") %>%
addPolygons(data = nigeria_filtered_regions,
fillColor = ~log_percentile_palette(lg_pp_p),
color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
popup = ~paste("<strong>Region:</strong> ", name, "<br>",
"<strong>Metric:</strong> ", round(lg_pp_p, 2)),
group = "95% Pop - Log") %>%
addLayersControl(
baseGroups = c("Population", "Mean Pop", "95% Pop",
"Pop - Log", "Mean Pop - Log",
"95% Pop - Log"),
options = layersControlOptions(collapsed = FALSE)
) %>%
addControl(html = legend_html, position = "bottomright") %>%
htmlwidgets::onRender("
function(el, x) {
var map = this;
function updateLegend(e) {
var layer = e.name;
var legendHTML = document.querySelector('.custom-legend');
legendHTML.innerHTML = legend_htmls[layer];
}
map.on('baselayerchange', updateLegend);
// Initialize legend display
updateLegend({name: 'Population'});
}
") %>%
tagList(
tags$style(HTML("
.custom-legend {
background-color: white;
padding: 10px;
border: 1px solid black;
border-radius: 5px;
box-shadow: 2px 2px 6px rgba(0,0,0,0.3);
font-size: 12px;
line-height: 18px;
}
.leaflet-bottom.leaflet-right .custom-legend {
margin-bottom: 20px;
}
"))
)
combined_map
The log of the 95th percentile produces the most interesting and useful results for clustering. Applying the log transformations is necessary because Lagos is much more densely and highly populated than the rest of Nigeria. This transformation helps normalize the data, making it easier to identify and analyze patterns in population distribution across different regions.
ggplot(nigeria_filtered_regions) +
geom_sf(aes(fill = lg_pp_p)) +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
labs(title = "95% Population (1km) - Log Scale",
fill = "Population") +
theme_minimal() +
theme(
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.background = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
To classify regions as urban or rural, we use k-means clustering. The elbow plot shows a clear bend at two clusters, supporting this methodology. This approach allows us to effectively differentiate between urban and rural areas based on population metrics, ensuring accurate and meaningful classification.
metric <- nigeria_filtered_regions$lg_pp_p
#metric <- nigeria_filtered_regions$log_pop_percentile
wss <- sapply(1:10, function(k){
kmeans(metric, centers = k, nstart = 25)$tot.withinss
})
wss_df <- data.frame(
clusters = 1:10,
wss = wss
)
ggplot(wss_df, aes(x = clusters, y = wss)) +
geom_point() +
geom_line() +
labs(title = "Elbow Method for Determining Optimal Number of Clusters",
x = "Number of Clusters",
y = "Total Within-Cluster Sum of Squares (WSS)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
After applying k-means clustering with two clusters, the regions are categorized as follows. The results align with expectations: regions containing the largest cities (Lagos, Abuja, Kano) are classified as urban, along with the more densely populated coastal areas. An exception is Ibadan. Although Ibadan is one of the largest cities in Nigeria, the Oyo region is not densely populated, thus it falls outside the urban classification.
set.seed(42)
kmeans_result <- kmeans(metric, centers = 2, nstart = 25)
nigeria_filtered_regions$Cluster <- as.factor(kmeans_result$cluster)
nigeria_filtered_regions_sf <- st_as_sf(nigeria_filtered_regions)
yl_or_rd_colors <- brewer.pal(3, "YlOrRd")[1:2]
cluster_labels <- c("1" = "Urban", "2" = "Rural")
ggplot(nigeria_filtered_regions_sf) +
geom_sf(aes(fill = Cluster)) +
labs(title = "K-means Clustering on Log Pop Population",
fill = "Cluster") +
geom_sf(data = largest_nigeria_cities, aes(geometry = geometry),
fill = "black", color = "black", size = 4, shape = 2, stroke = 1.2) +
geom_text(data = largest_nigeria_cities,
aes(x = st_coordinates(geometry)[,1],
y = st_coordinates(geometry)[,2],
label = NAME),
size = 3, nudge_y = 0.5, nudge_x = 0.1, color = "black") +
scale_fill_manual(values = yl_or_rd_colors, labels = cluster_labels) +
theme_minimal() +
theme(
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.background = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
For reference, the distribution of regions based on their log 95th percentile population is displayed, showing how the regions are grouped. Despite the log transformation, Lagos remains distinctly separate from other regions, indicating its uniquely high population density.
plot_data <- data.frame(
x = nigeria_filtered_regions$lg_pp_p,
cluster = nigeria_filtered_regions$Cluster
)
custom_colors <- c("blue", "orange")
ggplot(plot_data, aes(x = x, y = 0, color = cluster)) +
geom_point(size = 4) +
annotate("segment", x = min(plot_data$x), xend = max(plot_data$x), y = 0, yend = 0, size = 1) +
annotate("segment", x = min(plot_data$x), xend = min(plot_data$x), y = -0.1, yend = 0.1, size = 1) +
annotate("segment", x = max(plot_data$x), xend = max(plot_data$x), y = -0.1, yend = 0.1, size = 1) +
geom_text(aes(label = round(x, 1)), col = "white", vjust = -1.5, size = 3) +
scale_x_continuous(limits = range(plot_data$x)) +
scale_y_continuous(limits = c(-1, 1)) +
scale_color_manual(values = custom_colors) +
theme(panel.background = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.position = "none") +
labs(title = "K-means Clustering on Total Population")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
We select the following behaviors and interests to study as selected by Rama et Al (2020).
Selected Behaviors
Selected Interests
For each of the selected interests and behaviors, we need to acquire the correct keys to pull the accurate data. This ensures that our queries retrieve the specific information required for our analysis.
interests_df <- get_fb_parameter_ids("interests", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
#> Warning in get_fb_parameter_ids("interests", VERSION, TOKEN): Error code: 190
#> Warning in get_fb_parameter_ids("interests", VERSION, TOKEN): Error validating
#> access token: Session has expired on Friday, 05-Jul-24 06:57:04 PDT. The
#> current time is Friday, 02-Aug-24 05:27:46 PDT.
behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
#> Warning in get_fb_parameter_ids("behaviors", VERSION, TOKEN): Error code: 190
#> Warning in get_fb_parameter_ids("behaviors", VERSION, TOKEN): Error validating
#> access token: Session has expired on Friday, 05-Jul-24 06:57:04 PDT. The
#> current time is Friday, 02-Aug-24 05:27:46 PDT.
selected_behaviors <- c("Frequent Travelers",
"Frequent international travelers",
"Technology early adopters",
"Facebook access (mobile): Apple (iOS) devices",
"Lives abroad",
"Facebook access (network type): 4G",
"Facebook access (network type): 3G",
"Facebook access (mobile): Android devices")
selected_interests <- c("Gambling (gambling)",
#"Organic food (food & drink)",
"Concerts (music event)",
"Fast food restaurants (dining)",
#"Tattoos (body art)",
#"Reading (communication)",
"Restaurants (dining)",
#"Fitness and wellness (fitness)",
"Cooking (food & drink)",
#"Vegetarianism (diets)",
#"Jazz music (music)",
#"Hunting (sport)",
#"Fine art (visual art)",
"Politics (politics)")
behaviors_id <- behaviors_df %>%
filter(name %in% selected_behaviors) %>%
pull(id)
interests_id <- interests_df %>%
filter(name %in% selected_interests) %>%
pull(id)
We query each behavior and interests across the regions and save the data locally for further analysis and reference. Additionally, we also study gender data for each region. Gender is analyzed separately from behavior and interests because it was included as a distinct variable in the original study.
## Behaviors and Interests
for (behavior in behaviors_id) {
fb_df <- query_fb_marketing_api(
location_unit_type = "region",
location_keys = map_param_vec(NG_fb_df$key),
behaviors = behavior,
version = VERSION,
creation_act = CREATION_ACT,
token = TOKEN
)
write.csv(fb_df, file = paste0("data/nigeria/", behavior, ".csv"))
print(paste("Done with", behavior))
}
for (interest in interests_id) {
fb_df <- query_fb_marketing_api(
location_unit_type = "region",
location_keys = map_param_vec(NG_fb_df$key),
interests = interest,
version = c(VERSION, VERSION1),
creation_act = c(CREATION_ACT, CREATION_ACT1),
token = c(TOKEN, TOKEN1)
)
write.csv(fb_df, file = paste0("data/nigeria/", interest, ".csv"))
print(paste("Done with", interest))
}
## Gender
fb_df <- query_fb_marketing_api(
location_unit_type = "region",
location_keys = map_param_vec(NG_fb_df$key),
gender = map_param(1, 2),
version = VERSION,
creation_act = CREATION_ACT,
token = TOKEN
)
male <- fb_df %>%
filter(gender == 1)
write.csv(male, file = paste0("data/nigeria/male.csv"))
female <- fb_df %>%
filter(gender == 2)
write.csv(female, file = paste0("data/nigeria/female.csv"))
To compare the percentage of Facebook users in each region associated with a behavior, gender, or interest, we need to query the region-level Facebook population data. This data will provide the necessary baseline for calculating and comparing user engagement across different demographics and interests.
NG_fb_region_data <- query_fb_marketing_api(
location_unit_type = "region",
location_keys = map_param_vec(NG_fb_df$key),
version = VERSION,
creation_act = CREATION_ACT,
token = TOKEN)
NG_fb_region_data = NG_fb_region_data %>%
mutate(users = (estimate_mau_lower_bound + estimate_mau_upper_bound)/2) %>%
dplyr::select(location_keys, users) %>%
mutate(key = as.integer(location_keys))
We then conduct our statistical tests for each interest, behavior, and gender. Specifically, we use a t-test to compare the average population of Facebook users between urban and rural regions. The results, including the difference in means and the 95% confidence interval, are saved for further analysis and interpretation. Furthermore, the means and variances of rural, urban, and all regions is saved for each behavior.
nigeria_result_df <- data.frame(Difference_in_Means = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())
means_df <- data.frame(Behavior = character(), Mean = numeric(), Variance = numeric(), Group1_Mean = numeric(), Group1_Var = numeric(), Group2_Mean = numeric(), Group2_Var = numeric())
process_behavior <- function(x, y) {
data <- read.csv(paste0("data/nigeria/", y, ".csv"))
merged_data <- NG_fb_df %>%
mutate(key = as.integer(key)) %>%
left_join(nigeria_filtered_regions, by = "name") %>%
left_join(data, by = c("key" = "location_keys")) %>%
left_join(NG_fb_region_data, "key") %>%
mutate(average_pop = ((estimate_mau_lower_bound + estimate_mau_upper_bound))/2/users) %>%
dplyr::select(key, name, geometry, Cluster, average_pop)
overall_mean <- mean(merged_data$average_pop, na.rm = TRUE)
variance <- var(merged_data$average_pop, na.rm = TRUE)
group1 <- merged_data$average_pop[merged_data$Cluster == 1]
group2 <- merged_data$average_pop[merged_data$Cluster == 2]
means_df <<- rbind(means_df, data.frame(
Behavior = x,
Mean = overall_mean,
Variance = variance,
Group1_Mean = mean(group1, na.rm = TRUE),
Group1_Var = var(group1, na.rm = TRUE),
Group2_Mean = mean(group2, na.rm = TRUE),
Group2_Var = var(group2, na.rm = TRUE)
))
t_test_result <- t.test(group1, group2)
diff_means <- t_test_result$estimate[1] - t_test_result$estimate[2] # Difference in means
conf_interval <- t_test_result$conf.int
nigeria_result_df <<- rbind(nigeria_result_df, data.frame(
Difference_in_Means = diff_means,
Lower_CI = conf_interval[1],
Upper_CI = conf_interval[2],
behavior = x
))
}
mapply(process_behavior, c(selected_behaviors, selected_interests, "Male", "Female"), c(behaviors_id, interests_id, "male", "female"))
nigeria_result_df <- nigeria_result_df %>%
arrange(desc(Difference_in_Means)) %>%
mutate(behavior = factor(behavior, levels = unique(behavior)))
Below, we visualize our results. The right side of the graph represents more urban regions, while the left side corresponds to more rural areas.
Most of these results are as expected. Interests such as Restaurants, Politics, Lives Abroad, Fast Food, and iOS tend to be more prevalent in urban areas, which aligns with common assumptions. However, some findings are quite surprising, such as the higher prevalence of Frequent International Travelers and Early Technology Adopters in rural areas. According to Rama et al. (2020), Early Technology Adopters, while generally leaning towards urban areas, are not exclusively urban, with the interquartile range spanning both urban and rural regions.
nigeria_result_df <- nigeria_result_df %>%
arrange(desc(Difference_in_Means)) %>%
mutate(behavior = factor(behavior, levels = unique(behavior)))
ggplot(nigeria_result_df, aes(x = behavior, y = Difference_in_Means, ymin = Lower_CI, ymax = Upper_CI)) +
geom_pointrange() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
coord_flip() +
theme_minimal(base_size = 14) +
labs(title = "Differences in Means Between Urban and Rural Regions of Nigeria", x = NULL, y = "Difference in Means (More Rural - More Urban)") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16),
legend.title = element_text(size = 16),
legend.text = element_text(size = 18)
)
For reference, here are the overall means and the ranges within one standard deviation for each behavior across all regions. The graphs display the average values of each behavior, with error bars representing the variability. One observation, for example, is that while there is a noticeable bias towards males in rural areas and females in urban areas, both are still male dominant This indicates that rural areas have an overwhelming male presence on Facebook, whereas urban areas show a more balanced, yet still male-dominant, representation.
means_df$Behavior <- str_wrap(means_df$Behavior, width =40)
means_df <- means_df %>%
arrange(Mean) %>%
mutate(Behavior = factor(Behavior, levels = unique(Behavior)))
means_long <- means_df %>%
dplyr::select(Behavior, Mean, Variance, Group1_Mean, Group1_Var, Group2_Mean, Group2_Var) %>%
gather(key = "Group", value = "Value", -Behavior, -Variance, -Group1_Var, -Group2_Var) %>%
mutate(Group_Var = case_when(
Group == "Mean" ~ Variance,
Group == "Group1_Mean" ~ Group1_Var,
Group == "Group2_Mean" ~ Group2_Var
),
Group = recode(Group,
"Mean" = "Overall",
"Group1_Mean" = "Urban",
"Group2_Mean" = "Rural")
)
ggplot(means_long, aes(y = Behavior, x = Value * 100, fill = Group)) +
geom_bar(stat = "identity", position = position_dodge(0.8), width = 0.8) +
geom_errorbar(aes(xmin = (Value - sqrt(Group_Var)) * 100, xmax = (Value + sqrt(Group_Var)) * 100),
position = position_dodge(0.8), width = 0.3) +
scale_fill_manual(values = c("Overall" = "grey", "Urban" = "blue", "Rural" = "red")) +
theme(axis.text.y = element_text(angle = 0, hjust = 1)) +
labs(title = "Overall Mean with Variance and Cluster Means for Each Behavior", x = "Mean (%)", y = NULL) +
scale_x_continuous(labels = scales::percent_format(scale = 1)) +
theme(
axis.text.y = element_text(angle = 0, hjust = 1, vjust = 1),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
## Average Percent Male Users:
- Urban Areas: 54.44%
- Rural Areas: 65.85%
This vignette demonstrates how private data, specifically Facebook Marketplace data, can be leveraged where public data of communities is lacking or nonexistent. This is particularly pertinent for rural communities where data has historically been inaccurate, incomplete, or lacking. Using rSocialWatcher, we were able to query specific population and interest data at a region level using Facebook Marketplace data. By combining this with existing public data, we were able to derive insights regarding rural and urban differences in Nigeria. We hope to see how you utilize this package in the future for your social science projects.